feat: Implement PDF Report Streaming from Storage by felixjordandev · Pull Request #95 · LumintelAnalytics/ChainReport-API

felixjordandev · 2025-12-14T10:40:46Z

Overview

This PR updates the report download functionality to stream pre-generated PDF reports directly from a designated storage directory, rather than generating them on-the-fly from HTML.

Changes

Added a new PDF_REPORT_STORAGE_PATH setting to backend/app/core/config.py to specify the location of stored PDF reports (e.g., storage/reports).
Modified the get_report_pdf endpoint in backend/app/api/v1/report_download.py to use FileResponse to serve existing PDF files.
The endpoint now constructs the PDF file path using the report_id and the new PDF_REPORT_STORAGE_PATH setting.
Removed the dependency on weasyprint and tempfile for PDF generation in report_download.py since files are now streamed directly.

Summary by CodeRabbit

New Features
- PDF reports are now cached and served instantly on subsequent requests, improving load times.
Chores
- Updated directory structure for storing generated reports.
- Removed automatic cleanup of previously generated report files.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-14T10:40:54Z

Walkthrough

The changes implement PDF caching for report downloads, storing generated PDFs at a deterministic path based on report_id and serving cached copies without regeneration. A cleanup helper function and related cleanup logic are removed. The reports output directory path is updated from reports_output to storage/reports.

Changes

Cohort / File(s)	Summary
PDF caching and cleanup removal `backend/app/api/v1/report_download.py`	Removed cleanup_file helper function and post-generation cleanup logic. Added conditional check to serve cached PDFs if they exist. Introduced deterministic pdf_filepath based on report_id and ensured output directory exists before generation. Removed temporary file handling. Added logging to indicate whether cached or newly generated PDF is served.
Configuration path update `backend/app/core/config.py`	Updated REPORT_OUTPUT_DIR default path from BASE_DIR / "reports_output" to BASE_DIR / "storage" / "reports".

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Review the caching logic in report_download.py to ensure the conditional check correctly handles both cached and fresh PDF scenarios
Verify that directory creation (ensureDir) is robust and the deterministic path calculation is correct
Confirm that the path change in config.py aligns with project directory structure and any deployment configurations

Possibly related PRs

Feature: Add PDF download endpoint for reports #93: Directly inverse relationship—that PR added cleanup_file helper and background cleanup tasks around PDF generation, while this PR removes that helper/cleanup and replaces temporary file handling with deterministic cached PDF paths.

Poem

🐰 A cache is born, PDFs rest,
No cleanup haunt, just paths that nest,
Storage/reports, organized and spry,
Reports fly fast without a sigh! ✨📄

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: implementing PDF report streaming from storage, which aligns with the removal of on-demand PDF generation and the addition of caching/serving pre-generated PDFs from a designated storage directory.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/stream-pdf-reports

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

backend/app/api/v1/report_download.py (2)
1-1: Remove unused tempfile import.

The PR description states that dependencies on tempfile were removed, and indeed the code no longer uses temporary files. This import should be removed.

Apply this diff:
-import tempfile
 import os
105-105: Remove unused background_tasks parameter.

The background_tasks parameter is declared but never used in the function. Since PDFs are now cached indefinitely without cleanup, this parameter serves no purpose.

Apply this diff:
 async def get_report_pdf(
     report_id: str,
-    background_tasks: BackgroundTasks,
     db_session: AsyncSession = Depends(get_db),
And remove the unused import:
-from fastapi import APIRouter, Depends, HTTPException, status, BackgroundTasks
+from fastapi import APIRouter, Depends, HTTPException, status

🧹 Nitpick comments (1)

backend/app/api/v1/report_download.py (1)

135-146: No cleanup strategy for accumulated PDFs.

PDFs are cached indefinitely without any cleanup mechanism. Over time, this will consume increasing amounts of disk space, especially if reports are frequently generated or if old reports are no longer needed.

Consider implementing one of the following strategies:

LRU cache with size limits: Remove least recently accessed PDFs when storage reaches a threshold

Time-based expiration: Delete PDFs older than a configured retention period (e.g., 30 days)

On-demand regeneration: Add a query parameter or separate endpoint to force regeneration, allowing clients to refresh stale reports

You might also want to add monitoring/alerting for disk usage on the REPORT_OUTPUT_DIR volume.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4468bb8 and 961aba6.

📒 Files selected for processing (2)

backend/app/api/v1/report_download.py (1 hunks)
backend/app/core/config.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.14.8)

backend/app/core/config.py

54-54: Undefined name Path

(F821)

coderabbitai · 2025-12-14T10:42:44Z

backend/app/api/v1/report_download.py

+    if pdf_filepath.exists():
+        logger.info(f"Serving existing PDF report for report_id: {report_id}")
+        return FileResponse(
+            path=pdf_filepath,
+            media_type="application/pdf",
+            filename=f"report_{report_id}.pdf",
+            status_code=status.HTTP_200_OK
+        )
+
+    logger.info(f"Generating new PDF report for report_id: {report_id}")
    html_content = render_report_html(final_report_json)

-    pdf_path = None
    try:
        # Offload blocking PDF generation to a thread pool
        loop = asyncio.get_event_loop()
-        pdf_file = await loop.run_in_executor(
-            None,  # Use default thread pool executor
-            lambda: tempfile.NamedTemporaryFile(delete=False, suffix=".pdf")
-        )
-        pdf_path = pdf_file.name
        await loop.run_in_executor(
            None,
-            lambda: HTML(string=html_content).write_pdf(pdf_path)
+            lambda: HTML(string=html_content).write_pdf(pdf_filepath)
        )


⚠️ Potential issue | 🟠 Major

Race condition: concurrent PDF generation for the same report.

When multiple requests arrive simultaneously for a report that doesn't have a cached PDF, all requests will pass the pdf_filepath.exists() check and attempt to generate the PDF concurrently. This can lead to:

Corrupted PDF files due to simultaneous writes

Wasted compute resources

Inconsistent user experience

Consider using file-based locking or an in-memory lock (e.g., using asyncio.Lock stored in a dictionary keyed by report_id) to ensure only one request generates the PDF at a time, while others wait for the generation to complete.

Example approach using asyncio locks:

# At module level _pdf_generation_locks: dict[str, asyncio.Lock] = {} # In get_report_pdf function, before checking if PDF exists: if report_id not in _pdf_generation_locks: _pdf_generation_locks[report_id] = asyncio.Lock() async with _pdf_generation_locks[report_id]: # Re-check existence after acquiring lock if pdf_filepath.exists(): logger.info(f"Serving existing PDF report for report_id: {report_id}") return FileResponse(...) # Generate PDF...

🤖 Prompt for AI Agents

In backend/app/api/v1/report_download.py around lines 139 to 157, multiple concurrent requests can pass the pdf_filepath.exists() check and start PDF generation simultaneously; add a per-report asyncio lock to serialize generation: create a module-level dict mapping report_id to asyncio.Lock, acquire the lock before re-checking pdf_filepath.exists(), return the FileResponse if rediscovered, otherwise generate the PDF inside the lock (using the existing executor call), then release the lock and optionally remove the lock from the dict to avoid growth; this ensures only one coroutine writes the PDF while others wait and then serve the cached file.

coderabbitai · 2025-12-14T10:42:45Z

backend/app/core/config.py


    BASE_DIR: Path = Path(__file__).parent.parent.parent
-    REPORT_OUTPUT_DIR: Path = BASE_DIR / "reports_output"
+    REPORT_OUTPUT_DIR: Path = BASE_DIR / "storage" / "reports"


⚠️ Potential issue | 🔴 Critical

Critical: Missing import for Path.

The Path class is used on lines 53-54 but is not imported. This will cause a NameError at runtime when the Settings class is instantiated.

Add the missing import at the top of the file:

from pydantic_settings import BaseSettings, SettingsConfigDict from typing import Dict, List +from pathlib import Path

🧰 Tools

🪛 Ruff (0.14.8)

54-54: Undefined name Path

(F821)

🤖 Prompt for AI Agents

In backend/app/core/config.py around line 54, the Settings class uses Path (REPORT_OUTPUT_DIR) but Path is not imported, causing a NameError; add the missing import by importing Path from the pathlib module at the top of the file (for example: from pathlib import Path) so the REPORT_OUTPUT_DIR assignment resolves correctly.

feat: Add PDF report streaming from storage

961aba6

coderabbitai bot reviewed Dec 14, 2025

View reviewed changes

klingonaston approved these changes Dec 14, 2025

View reviewed changes

klingonaston merged commit b71ada6 into main Dec 14, 2025
1 check passed

klingonaston deleted the feat/stream-pdf-reports branch December 14, 2025 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement PDF Report Streaming from Storage#95

feat: Implement PDF Report Streaming from Storage#95
klingonaston merged 1 commit intomainfrom
feat/stream-pdf-reports

felixjordandev commented Dec 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 14, 2025

Uh oh!

coderabbitai bot Dec 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

felixjordandev commented Dec 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

felixjordandev commented Dec 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 14, 2025 •

edited

Loading